scene element
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
Le, Huy, Chung, Nhat, Kieu, Tung, Nguyen, Anh, Le, Ngan
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.
- North America > United States > Arkansas (0.04)
- Europe > United Kingdom > England > Merseyside > Liverpool (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- Asia > Vietnam (0.04)
- Health & Medicine (0.46)
- Leisure & Entertainment (0.46)
Physically Plausible 3D Human-Scene Reconstruction from Monocular RGB Image using an Adversarial Learning Approach
Biswas, Sandika, Li, Kejie, Banerjee, Biplab, Chaudhuri, Subhasis, Rezatofighi, Hamid
Holistic 3D human-scene reconstruction is a crucial and emerging research area in robot perception. A key challenge in holistic 3D human-scene reconstruction is to generate a physically plausible 3D scene from a single monocular RGB image. The existing research mainly proposes optimization-based approaches for reconstructing the scene from a sequence of RGB frames with explicitly defined physical laws and constraints between different scene elements (humans and objects). However, it is hard to explicitly define and model every physical law in every scenario. This paper proposes using an implicit feature representation of the scene elements to distinguish a physically plausible alignment of humans and objects from an implausible one. We propose using a graph-based holistic representation with an encoded physical representation of the scene to analyze the human-object and object-object interactions within the scene. Using this graphical representation, we adversarially train our model to learn the feasible alignments of the scene elements from the training data itself without explicitly defining the laws and constraints between them. Unlike the existing inference-time optimization-based approaches, we use this adversarially trained model to produce a per-frame 3D reconstruction of the scene that abides by the physical laws and constraints. Our learning-based method achieves comparable 3D reconstruction quality to existing optimization-based holistic human-scene reconstruction methods and does not need inference time optimization. This makes it better suited when compared to existing methods, for potential use in robotic applications, such as robot navigation, etc.
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Norway > Norwegian Sea (0.04)
SyntEO: Synthetic Dataset Generation for Earth Observation with Deep Learning -- Demonstrated for Offshore Wind Farm Detection
Hoeser, Thorsten, Kuenzer, Claudia
With the emergence of deep learning in the last years, new opportunities arose in Earth observation research. Nevertheless, they also brought with them new challenges. The data-hungry training processes of deep learning models demand large, resource expensive, annotated datasets and partly replaced knowledge-driven approaches, so that model behaviour and the final prediction process became a black box. The proposed SyntEO approach enables Earth observation researchers to automatically generate large deep learning ready datasets and thus free up otherwise occupied resources. SyntEO does this by including expert knowledge in the data generation process in a highly structured manner. In this way, fully controllable experiment environments are set up, which support insights in the model training. Thus, SyntEO makes the learning process approachable and model behaviour interpretable, an important cornerstone for explainable machine learning. We demonstrate the SyntEO approach by predicting offshore wind farms in Sentinel-1 images on two of the worlds largest offshore wind energy production sites. The largest generated dataset has 90,000 training examples. A basic convolutional neural network for object detection, that is only trained on this synthetic data, confidently detects offshore wind farms by minimising false detections in challenging environments. In addition, four sequential datasets are generated, demonstrating how the SyntEO approach can precisely define the dataset structure and influence the training process. SyntEO is thus a hybrid approach that creates an interface between expert knowledge and data-driven image analysis.
- Europe > United Kingdom > England (0.14)
- Europe > North Sea (0.05)
- Atlantic Ocean > North Atlantic Ocean > North Sea (0.05)
- (15 more...)
A Knowledge-based Approach for the Automatic Construction of Skill Graphs for Online Monitoring
Jatzkowski, Inga, Menzel, Till, Maurer, Markus
Automated vehicles need to be aware of the capabilities they currently possess. Skill graphs are directed acylic graphs in which a vehicle's capabilities and the dependencies between these capabilities are modeled. The skills a vehicle requires depend on the behaviors the vehicle has to perform and the operational design domain (ODD) of the vehicle. Skill graphs were originally proposed for online monitoring of the current capabilities of an automated vehicle. They have also been shown to be useful during other parts of the development process, e.g. system design, system verification. Skill graph construction is an iterative, expert-based, manual process with little to no guidelines. This process is, thus, prone to errors and inconsistencies especially regarding the propagation of changes in the vehicle's intended ODD into the skill graphs. In order to circumnavigate this problem, we propose to formalize expert knowledge regarding skill graph construction into a knowledge base and automate the construction process. Thus, all changes in the vehicle's ODD are reflected in the skill graphs automatically leading to a reduction in inconsistencies and errors in the constructed skill graphs.
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.05)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- Automobiles & Trucks (0.69)
- Transportation (0.47)
A Scenario-Based Development Framework for Autonomous Driving
We systematically analyzed previous research works and proposed the definition of scenario, the elements of the scenario ontology, the data source of the scenario, the processing method of the scenario data, and scenario-based V-Model. Moreover, we summarized the automated test scenario construction method by random scenario generation and dangerous scenario generation.
- Europe > Germany (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
MONet: Unsupervised Scene Decomposition and Representation
Burgess, Christopher P., Matthey, Loic, Watters, Nicholas, Kabra, Rishabh, Higgins, Irina, Botvinick, Matt, Lerchner, Alexander
Realistic visual scenes contain rich structure, which humans effortlessly exploit to reason effectively and intelligently. In particular, object perception, the ability to perceive and represent individual objects, is considered a fundamental cognitive ability that allows us to understand - and efficiently interact with - the world as perceived through our senses [Johnson, 2018, Green and Quilty-Dunn, 2017]. However, despite recent breakthroughs in computer vision fuelled by advances in deep learning, learning to represent realistic visual scenes in terms of objects remains an open challenge for artificial systems. The impact and application of robust visual object decomposition would be far-reaching. Models such as graph-structured networks that rely on handcrafted object representations have recently achieved remarkable results in a wide range of research areas, including reinforcement learning, physical modeling, and multi-agent control [Battaglia et al., 2018, Wang et al., 2018, Hamrick et al., 2017, Hoshen, 2017]. The prospect of acquiring visual object representations through unsupervised learning could be invaluable for extending the generality and applicability of such models.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
A POMDP Model of Eye-Hand Coordination
Erez, Tom (Washington University in St. Louis) | Tramper, Julian J. (Radboud University) | Smart, William D (Washington University in St. Louis) | Gielen, Stan CAM (Radboud University)
This paper presents a generative model of eye-hand coordination. We use numerical optimization to solve for the joint behavior of an eye and two hands, deriving a predicted motion pattern from first principles, without imposing heuristics. We model the planar scene as a POMDP with 17 continuous state dimensions. Belief-space optimization is facilitated by using a nominal-belief heuristic, whereby we assume (during planning) that the maximum likelihood observation is always obtained. Since a globally-optimal solution for such a high-dimensional domain is computationally intractable, we employ local optimization in the belief domain. By solving for a locally-optimal plan through belief space, we generate a motion pattern of mutual coordination between hands and eye: the eye's saccades disambiguate the scene in a task-relevant manner, and the hands' motions anticipate the eye's saccades. Finally, the model is validated through a behavioral experiment, in which human subjects perform the same eye-hand coordination task. We show how simulation is congruent with the experimental results.
- Europe > Netherlands > Gelderland > Nijmegen (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)